Skip to content

Reduce histogram metric cardinality for baseline resync#397

Merged
xiaoxichen merged 1 commit intoeBay:mainfrom
xiaoxichen:reduce-histogram-cardinality
Mar 11, 2026
Merged

Reduce histogram metric cardinality for baseline resync#397
xiaoxichen merged 1 commit intoeBay:mainfrom
xiaoxichen:reduce-histogram-cardinality

Conversation

@xiaoxichen
Copy link
Collaborator

Convert snapshot histogram metrics from full bucket distributions to sum/count publishing mode, reducing Prometheus cardinality by 82-93%.

Changes:

  • Convert 5 snapshot histograms to publish_as_sum_count:

    • snp_dnr_blob_process_latency (donor)
    • snp_dnr_batch_process_latency (donor)
    • snp_dnr_batch_e2e_latency (donor)
    • snp_rcvr_blob_process_time (receiver)
    • snp_rcvr_batch_process_time (receiver)
  • Update 4 GC histogram comments for accuracy:

    • PercentileBuckets: 128 → 10 buckets (10% increments)
    • LinearUpto64Buckets: Clarify 17 buckets covering 0-64s

Full histogram data still collected locally for JSON API. Design doc: docs/plans/2026-03-11-reduce-histogram-cardinality-design.md

@xiaoxichen xiaoxichen force-pushed the reduce-histogram-cardinality branch from 1151615 to 38b2946 Compare March 11, 2026 08:29
Copy link
Contributor

@yuwmao yuwmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG in general, snapshot metrics will be removed once it completes, so it won't account too much.

@xiaoxichen xiaoxichen force-pushed the reduce-histogram-cardinality branch 2 times, most recently from 0b06068 to cc0ab15 Compare March 11, 2026 08:55
Convert snapshot histogram metrics from full bucket distributions to
sum/count publishing mode

Changes:
- Convert 4 snapshot histograms to publish_as_sum_count:
  - snp_dnr_blob_process_latency (donor)
  - snp_dnr_batch_process_latency (donor)
  - snp_rcvr_blob_process_time (receiver)
  - snp_rcvr_batch_process_time (receiver)

- snapshot batch e2e remains as histogram
  - snp_dnr_batch_e2e_latency (donor)

- Update 4 GC histogram comments for accuracy:
  - PercentileBuckets: 128 → 10 buckets (10% increments)
  - LinearUpto64Buckets: Clarify 17 buckets covering 0-64s

Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
@xiaoxichen xiaoxichen force-pushed the reduce-histogram-cardinality branch from cc0ab15 to 394bafb Compare March 11, 2026 08:56
@xiaoxichen
Copy link
Collaborator Author

LG in general, snapshot metrics will be removed once it completes, so it won't account too much.

yes , leaving the E2E as histogram.

Copy link
Collaborator

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC metrics related changes looks good to me.

Copy link
Contributor

@Besroy Besroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaoxichen xiaoxichen merged commit a5baea7 into eBay:main Mar 11, 2026
65 of 73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants